Code
install.packages(c( 'quarto', 'stm', 'stminsights', 'textdata',
'quanteda', 'quanteda.textstats',
'quanteda.textmodels', 'quanteda.textplots',
'tidymodels'))Basics and Bag-Of-Words
Please make sure your R and R Studio environment is up to date as we will make use of recent R features (such as the native pipe operator |> ) and recent documentation packages (Quarto)
These are the new packages that we will need today:
install.packages(c( 'quarto', 'stm', 'stminsights', 'textdata',
'quanteda', 'quanteda.textstats',
'quanteda.textmodels', 'quanteda.textplots',
'tidymodels'))Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Computers store text in form of digits. For each character, e.g. a, there is a corresponding sequence of numbers used for character encoding.
Today, many different standards for text encoding exist (e.g. ASCII, UTF-8, UTF-16, Latin-1, ..). If you read textual data and declare the wrong encoding, some characters will not be parsed correctly:
all_good <- readLines('data/encoding_issues_utf8.txt',
encoding = 'UTF-8')
all_good[1] "NLP rocks" "Éncôdíng_cäuses_headaçhe$"
suffering <- readLines('data/encoding_issues_utf8.txt',
encoding = 'latin1')
suffering[1] "NLP rocks" "ÉncôdÃng_cäuses_headaçhe$"
utf-8 is commonly used and the default encoding for many R packages (e.g. tidyverse)“We destroy language to turn it into data” - Ken Benoit, IC2S2 2019
“Make love, not war.” -> c('make', 'love', 'not', 'war')
tokens define semantic elements of strings; important for many applications of text analysis (e.g. counting)
predominantly separated by white space and punctuation marks
converting text to tokens raises several questions:
Examples of questions we might ask:
library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(lubridate)
df <- read_csv('data/donors_choose_sample.csv')
df |> pull(project_title) |> head(3)[1] "Books for Brains"
[2] "Softball Teams Need Gloves and Ball Part 3"
[3] "Reading and Math Games for Sneaky Learning (They'll Never Know!)"
cat(df$project_essay[1])A typical day in our room consists of a lot of questions coming from interested minds. Our students' brains are like sponges at this age, soaking up every bit of information they can. However some times there are underlying factors that students struggle with on a daily basis that clouds their learning.
We have really amazing students at our school and I couldn't be more proud of their thirst for learning!
Our school is located in intercity urban San Antonio. Due to the low socioeconomic status of our school's community, our students often come in with more on their mind than learning fractions. Domestic violence, homelessness, and hunger are many things that our students face when they go home.
At our school, we take pride in our students and truly care about how each student is feeling. We go out of our way to provide different services such as: counselors, snacks, plenty of hugs, etc., to handle each situation so that our students' can come right back and perform at their best. <!--DONOTREMOVEESSAYDIVIDER-->I am hoping to re-incorporate rich literature into my students daily school life. The students at my school are so plugged in, consumed by their devices, they sometimes miss out on broadening their imagination.
We live in a world being consumed by technology.
The books I am requesting coincide with a genre study, a study on a fantastic children's author, and so much more. I want to fill my library with high interest books to put in student hands alongside their technological devices. They will have the best of both worlds! I have a passion for children's literature that I hope to inspire my students with!
We use a regular expression to clean up the donation texts:
# remove noise
df$project_essay <- str_replace_all(df$project_essay,
pattern = '<!--DONOTREMOVEESSAYDIVIDER-->',
replacement = '\n\n')
# validate
str_detect(df$project_essay[1],
'<!--DONOTREMOVEESSAYDIVIDER-->')[1] FALSE
tidytext, text2vecYou can create a quanteda corpus from (1) a character vector or (2) a data frame, which automatically includes meta data as document variables:
donor_corp <- corpus(df, text_field = 'project_essay',
docid_field = 'project_id')
docvars(donor_corp)$text <- df$project_essay # store unprocessed text
ndoc(donor_corp) # no. of documents[1] 10000
Tokens can be created from a corpus or character vector. The documentation (?tokens()) illustrates several options, e.g. for the removal of punctuation
donor_tokens <- tokens(donor_corp,
remove_numbers = TRUE) # removing digits
donor_tokens[[1]][1:20] # text 1, first 20 tokens [1] "A" "typical" "day" "in" "our"
[6] "room" "consists" "of" "a" "lot"
[11] "of" "questions" "coming" "from" "interested"
[16] "minds" "." "Our" "students" "'"
Corpus objects can be used to discover keywords in context (KWIC):
kwic_donor <- kwic(donor_tokens, pattern = c("ipad"),
window = 5) # context window
head(kwic_donor, 3)Keyword-in-context with 3 matches.
[54bea65a2cadc0f79f367fb1b76d6cfc, 15] is very important. The | iPad |
[54bea65a2cadc0f79f367fb1b76d6cfc, 29] use advanced technology. An | iPad |
[54bea65a2cadc0f79f367fb1b76d6cfc, 188] computer, much less an | iPad |
will also help my students
would enhance my students'
. The use of an
love, loving)words <- c("love", "loving", "lovingly",
"loved", "lover", "lovely")
char_wordstem(words, 'english')[1] "love" "love" "love" "love" "lover" "love"
Multiple preprocessing steps can be chained via the pipe operator, e.g normalizing to lowercase and removing common English stopwords:
donor_tokens <- donor_tokens |>
tokens_tolower() |>
tokens_remove(stopwords('english'),
padding = TRUE) # keep empty strings
donor_tokens[[1]][1:10] [1] "" "typical" "day" "" "" "room"
[7] "consists" "" "" "lot"
United Statestextstat_select())colls <- textstat_collocations(donor_tokens,
min_count = 200) # minimum frequency
donor_tokens_c <- tokens_compound(donor_tokens, colls) |>
tokens_remove('') # remove empty strings
donor_tokens_c[[1]][1:5] # first five tokens of first text[1] "typical_day" "room" "consists" "lot" "questions"
| doc_id | I | like | hate | currywurst |
|---|---|---|---|---|
| 1 | 1 | 1 | 0 | 1 |
| 2 | 1 | 0 | 1 | 1 |
Quanteda can create DFM’s from tokens and other DFM objects:
dfm_donor <- dfm(donor_tokens_c)
dim(dfm_donor)[1] 10000 28041
As an alternative (or complement) to manually defining stopwords, terms occuring in either very few or almost all documents can be removed automatically.
dfm_donor <- dfm_donor |>
dfm_keep(min_nchar = 2) |> # remove terms < 2 characters
dfm_trim(min_docfreq = 0.001, # 0.1% min
max_docfreq = 0.50, # 50% max
docfreq_type = 'prop') # proportions instead of counts
dim(dfm_donor)[1] 10000 6826
textplot_wordcloud(dfm_donor, max_words = 100, color = 'black')textstat_frequency(dfm_donor) |> head(10) feature frequency rank docfreq group
1 reading 8276 1 3453 all
2 many 7938 2 4918 all
3 able 7933 3 4703 all
4 use 7804 4 4656 all
5 class 7299 5 4333 all
6 work 7266 6 4282 all
7 need 7178 7 4487 all
8 books 6916 8 2379 all
9 technology 5979 9 2578 all
10 love 5921 10 3773 all
Features can be created from n sequences of tokens.
text <- "to be or not to be"
tokens(text) |> tokens_ngrams(1:2) # unigrams + bigramsTokens consisting of 1 document.
text1 :
[1] "to" "be" "or" "not" "to" "be" "to_be" "be_or"
[9] "or_not" "not_to" "to_be"
tokens(text) |> tokens_ngrams(3) # trigrams onlyTokens consisting of 1 document.
text1 :
[1] "to_be_or" "be_or_not" "or_not_to" "not_to_be"
see dfm_tfidf() for the quanteda implementation
spacyr (Python wrapper) and udpipeQuestions?